AITopics

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.32)

Neural Information Processing SystemsFeb-10-2026, 15:30:50 GMT

EmergentCommunication

Recall that ˆmc(u) is exactly the listener's decoder in the IB framework (see Section 3.1.1). Therefore, anyother decoder would lend an upper bound on the informativeness loss term. Notice that under our assumptions,ˆmc is a Gaussian mixture, whereas the speaker's beliefs are simply Gaussian. All the systems with the samek form an equivalence class and the canonical system within each class is the one with minimalk. These canonical systems are the natural one to prefer, because they can attain the optimum for a given complexity with aminimal codebook.

architecture, artificial intelligence, machine learning, (17 more...)

Country:

Europe > France (0.05)
North America > United States (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.96)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.30)

Neural Information Processing SystemsFeb-8-2026, 20:13:29 GMT

6e01383fd96a17ae51cc3e15447e7533-AuthorFeedback.pdf

agent, architecture, experiment, (14 more...)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning (0.50)

Neural Information Processing SystemsFeb-8-2026, 05:36:43 GMT

324bb74b6d557428e21528379eeb7a0c-Supplemental-Conference.pdf

agent architecture, filter size 3, hyperparameter, (2 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.32)

Jobs, Niklas, da Silva, Luis Miguel Vieira, Somashekaraiah, Jayanth, Weigand, Maximilian, Kube, David, Gehlhoff, Felix

Benchmark for Planning and Control with Large Language Model Agents: Blocksworld with Model Context Protocol

arXiv.org Artificial IntelligenceDec-4-2025

Industrial automation increasingly requires flexible control strategies that can adapt to changing tasks and environments. Agents based on Large Language Models (LLMs) offer potential for such adaptive planning and execution but lack standardized benchmarks for systematic comparison. We introduce a benchmark with an executable simulation environment representing the Blocksworld problem providing five complexity categories. By integrating the Model Context Protocol (MCP) as a standardized tool interface, diverse agent architectures can be connected to and evaluated against the benchmark without implementation-specific modifications. A single-agent implementation demonstrates the benchmark's applicability, establishing quantitative metrics for comparison of LLM-based planning and execution approaches.

artificial intelligence, large language model, natural language, (18 more...)

2512.03955

Country: Europe > Germany (0.28)

Genre: Research Report (0.82)

Industry: Leisure & Entertainment > Games (0.35)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Rahmani, Mahdi, Saffari, AmirHossein, Rahmani, Reyhane

MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation

arXiv.org Artificial IntelligenceDec-1-2025

Small and medium - sized enterprises (SMEs) in Iran increasingly leverage Telegram for sales, where real - time engagement is essential for conversion. However, developing AI - driven chatbots for this purpose requires large, high - quality question - and - answer (Q&A) datasets, which are typically expensive and resource - intensive to produce, especially for low - resource languages like Persian. In this paper, we introduce MegaChat, the first fully synthetic Persian Q&A dataset designed to evaluate intelligent sales ch atbots in Telegram - based e - commerce. We propose a novel, automated multi - agent architecture that generates persona - aware Q&A pairs by collecting data from active Telegram shopping channels. The system employs specialized agents for question generation, validation, and refinement, ensuring the production of realistic and diverse conversational data. To evaluate answer generation, we compare three classic retrieval - augmented generation (RAG) models with our advanced agentic system, which features multi - query retrieval, reranking, and persona - aligned response synthesis. Using GPT - 5.1 for evaluation across six quality dimensions, our results show that the agentic architecture outperformed traditional RAG models in 4 out of 5 diverse channels, demonstrating its ability to generate scalable, high - quality datasets without relying on expensive human annotation or complex fine - tuning. MegaChat provides SMEs with an efficient, cost - effective solution for building intelligent customer engagement systems in specialized c ommercial domains, enabling advancements in multilingual conversational AI for low - resource languages.

large language model, machine learning, natural language, (19 more...)

2511.23397

Country:

Asia > Middle East > Iran (0.36)
North America > United States (0.28)

Genre: Research Report > New Finding (0.86)

Industry: Information Technology > Services (0.36)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

AlShikh, Waseem, Ali, Muayad Sayed, Kennedy, Brian, Mozolevskyi, Dmytro

Towards Outcome-Oriented, Task-Agnostic Evaluation of AI Agents

arXiv.org Artificial IntelligenceNov-12-2025

As AI agents proliferate across industries and applications, evaluating their performance based solely on infrastructural metrics such as latency, time-to-first-token, or token throughput is proving insufficient. These metrics fail to capture the quality of an agent's decisions, its operational autonomy, or its ultimate business value. This white paper proposes a novel, comprehensive framework of eleven outcome-based, task-agnostic performance metrics for AI agents that transcend domain boundaries. These metrics are designed to enable organizations to evaluate agents based on the quality of their decisions, their degree of autonomy, their adaptability to new challenges, and the tangible business value they deliver, regardless of the underlying model architecture or specific use case. We introduce metrics such as Goal Completion Rate (GCR), Autonomy Index (AIx), Multi-Step Task Resilience (MTR), and Business Impact Efficiency (BIE). Through a large-scale simulated experiment involving four distinct agent architectures (ReAct, Chain-of-Thought, Tool-Augmented, Hybrid) across five diverse domains (Healthcare, Finance, Marketing, Legal, and Customer Service), we demonstrate the framework's efficacy. Our results reveal significant performance trade-offs between different agent designs, highlighting the Hybrid Agent as the most consistently high-performing model across the majority of our proposed metrics, achieving an average Goal Completion Rate of 88.8\% and the highest Return on Investment (ROI). This work provides a robust, standardized methodology for the holistic evaluation of AI agents, paving the way for more effective development, deployment, and governance.

agent, artificial intelligence, task-agnostic evaluation, (10 more...)

2511.08242

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine (0.50)
Banking & Finance (0.47)
Law (0.36)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)

arXiv.org Artificial IntelligenceOct-27-2025

AgentArcEval: An Architecture Evaluation Method for Foundation Model based Agents

Lu, Qinghua, Zhao, Dehai, Liu, Yue, Zhang, Hao, Zhu, Liming, Xu, Xiwei, Shi, Angela, Tan, Tristan, Kazman, Rick

The emergence of foundation models (FMs) has enabled the development of highly capable and autonomous agents, unlocking new application opportunities across a wide range of domains. Evaluating the architecture of agents is particularly important as the architectural decisions significantly impact the quality attributes of agents given their unique characteristics, including compound architecture, autonomous and non-deterministic behaviour, and continuous evolution. However, these traditional methods fall short in addressing the evaluation needs of agent architecture due to the unique characteristics of these agents. Therefore, in this paper, we present AgentArcEval, a novel agent architecture evaluation method designed specially to address the complexities of FM-based agent architecture and its evaluation. Moreover, we present a catalogue of agent-specific general scenarios, which serves as a guide for generating concrete scenarios to design and evaluate the agent architecture. We demonstrate the usefulness of AgentArcEval and the catalogue through a case study on the architecture evaluation of a real-world tax copilot, named Luna.

artificial intelligence, deep learning, machine learning, (19 more...)

doi: 10.1016/j.jss.2025.112656

2510.21031

Country:

Oceania > Australia (0.46)
North America > United States (0.28)

Genre:

Research Report (0.64)
Overview (0.46)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Government > Tax (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Neural Information Processing SystemsOct-3-2025, 04:47:55 GMT

6e01383fd96a17ae51cc3e15447e7533-AuthorFeedback.pdf

agent, architecture, artificial intelligence, (15 more...)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning (0.50)

arXiv.org Artificial IntelligenceOct-3-2025

A cybersecurity AI agent selection and decision support framework

Malatji, Masike

This paper presents a novel, structured decision support framework that systematically aligns diverse artificial intelligence (AI) agent architectures, reactive, cognitive, hybrid, and learning, with the comprehensive National Institute of Standards and Technology (NIST) Cybersecurity Framework (CSF) 2.0. By integrating agent theory with industry guidelines, this framework provides a transparent and stepwise methodology for selecting and deploying AI solutions to address contemporary cyber threats. Employing a granular decomposition of NIST CSF 2.0 functions into specific tasks, the study links essential AI agent properties such as autonomy, adaptive learning, and real-time responsiveness to each subcategory's security requirements. In addition, it outlines graduated levels of autonomy (assisted, augmented, and fully autonomous) to accommodate organisations at varying stages of cybersecurity maturity. This holistic approach transcends isolated AI applications, providing a unified detection, incident response, and governance strategy. Through conceptual validation, the framework demonstrates how tailored AI agent deployments can align with real-world constraints and risk profiles, enhancing situational awareness, accelerating response times, and fortifying long-term resilience via adaptive risk management. Ultimately, this research bridges the gap between theoretical AI constructs and operational cybersecurity demands, establishing a foundation for robust, empirically validated multi-agent systems that adhere to industry standards.

artificial intelligence, deep learning, machine learning, (18 more...)